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ABSTRACT 

With the multiplication of XML data sources, many XML 
data warehouse models have been proposed to handle data 
heterogeneity and complexity in a way relational data ware- 
houses fail to achieve. However, XML-native database sys- 
tems currently suffer from limited performances, both in 
terms of manageable data volume and response time. Frag- 
mentation helps address both these issues. Derived horizon- 
tal fragmentation is typically used in relational data ware- 
houses and can definitely be adapted to the XML context. 
However, the number of fragments produced by classical al- 
gorithms is difficult to control. In this paper, we propose the 
use of a k-means-based fragmentation approach that allows 
to master the number of fragments through its k parameter. 
We experimentally compare its efficiency to classical derived 
horizontal fragmentation algorithms adapted to XML data 
warehouses and show its superiority. 

Categories and Subject Descriptors 

H. 2 [Database Management]: Physical Design 

General Terms 

Performance 

I. INTRODUCTION 

XML data sources that are pertinent for decision-support 
are becoming increasingly common with XML becoming a 
standard for representing complex business data [6j. How- 
ever, they bear specificities (e.g., heterogeneous number and 
order of dimensions or complex measures in facts, ragged di- 
mension hierarchies, etc.) that would be intricate to handle 
in a relational environment. Hence, many efforts toward 
XML data warehousing have been achieved in the past few 
years [131 1351 139] , as well as efforts for extending the XQuery 
language with near On-Line Analytical Processing (OLAP) 
capabilities, such as advanced grouping and aggregation fea- 
tures H [211 [37]. 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

DOLAP'08, October 30, 2008, Napa Valley, California, USA. 
Copyright 2008 ACM 978-1-60558-250-4/08/10 ...$5.00. 



Jerome Darmont 

University of Lyon (ERIC Lyon 2) 
5 avenue Pierre IVIendes-France 
69676 Bron Cedex - France 
jerome.darmont@univ-lyon2.fr 



In this context, performance is a critical issue, since XML- 
native database systems currently suffer from limited perfor- 
mances, both in terms of manageable data volume and re- 
sponse time for complex analytical queries. These issues are 
typical of data warehouses and can be addressed by frag- 
mentation. Fragmentation consists in splitting a data set 
into several fragments such that their combination yields 
the original warehouse without loss nor information addi- 
tion. Fragmentation can subsequently lead to distribute the 
target data warehouse, e.g., on a data grid [16]. In the rela- 
tional context, derived horizontal fragmentation is acknowl- 
edged as best-suited to data warehouses, because it takes 
decision-support query requirements into consideration and 
avoids computing unnecessary join operations [5]. Several 
approaches have also been proposed for XML data fragmen- 
tation, but they do not take multidimensional architectures 
(i.e., star-like schemas) into account. 

In derived horizontal fragmentation, dimensions' primary 
horizontal fragmentation is crucial. In the relational con- 
text, two major algorithms address this issue: the predicate 
construction [32] and the affinity-based [31] strategies. How- 
ever, both are automatic and the number of fragments is not 
known in advance, while it is crucial to master it, especially 
since distributing M fragments over nodes with M > N 
can become an issue in itself. Hence, we propose in this 
paper the use of a k-means-based fragmentation approach 
that allows to control the number of fragments through its 
k parameter. Our idea, which adapts a proposal from the 
object-oriented domain [Tl] to XML warehouses, is to cluster 
workload query predicates to produce primary horizontal di- 
mension fragments, with one fragment corresponding to one 
cluster of predicates. Primary fragmentation is then derived 
on facts. Queries including given predicates are executed 
over the corresponding fragments only, instead of the whole 
warehouse, and thus run faster. 

The remainder of this paper is organized as follows. We 
first discuss existing research related to relational data ware- 
house, XML data and data mining-based fragmentation in 
Section [5] Then, we present our k-means-based XML data 
warehouse fragmentation approach in Section [3] We experi- 
mentally compare its efficiency to classical derived horizon- 
tal fragmentation algorithms adapted to XML data ware- 
houses and show its superiority in Section [4] Finally, we 
conclude this paper and present future research directions 
in Section [5] 



2. RELATED WORK 

2.1 Fragmentation Definition 

There are three fragmentation types in the relational con- 
text [23]: vertical fragmentation, horizontal fragmentation 
and hybrid fragmentation. 

Vertical fragmentation splits a relation R into subrela- 
tions that are projections of R with respect to a subset of 
attributes. It consists in grouping together attributes that 
are frequently accessed by queries. Vertical fragments are 
built by projection. The original relation is reconstructed 
by joining the fragments. 

Horizontal fragmentation divides a relation into subsets 
of tuples using query predicates. It reduces query process- 
ing costs by minimizing the number of irrelevant accessed 
instances. Horizontal fragments are built by selection. The 
original relation is reconstructed by fragment union. A vari- 
ant, derived horizontal fragmentation, consists in partition- 
ing a relation with respect to predicates defined on another 
relation. 

Finally, hybrid fragmentation consists of either horizontal 
fragments that are subsequently vertically fragmented, or 
vertical fragments that are subsequently horizontally frag- 
mented. 

2.2 Data Warehouse Fragmentation 

Many research studies address the issue of fragmenting 
relational data warehouses either to efficiently process ana- 
lytical queries or to distribute the warehouse. 

To improve ad-hoc query performance, Datta et al. ex- 
ploit a vertical fragmentation of facts to build the Cuio in- 
dex [15], while Golfarelli et al. apply the same fragmentation 
on warehouse views |19j . Munneke et al. propose a fragmen- 
tation methodology for a multidimensional database [30j . 
Fragmentation consists in deriving a global data cube into 
fragments containing a subset of data. This process is de- 
fined by the slice and dice operation. The authors also define 
another fragmentation strategy, server, that removes one or 
several dimensions from a hypercube to produce fragments 
with fewer dimensions than the original data cube. Bella- 
treche and Boukhalfa apply horizontal fragmentation to a 
star-schema [5]- Their fragmentation strategy is based on 
a query workload and exploits a genetic algorithm to se- 
lect a partitioning schema. This algorithm aims at choosing 
an optimal fragmentation schema that minimizes query cost. 
Finally, Wu and Buchmaan recommend to combine horizon- 
tal and vertical fragmentation for query optimization [38j . 
A fact table can be horizontally partitioned with respect to 
one or more dimensions. It can also be vertically partitioned 
according to its dimensions, i.e., all the foreign keys to the 
dimension tables are partitioned as separate tables. 

To distribute a data warehouse, Noaman et al. exploit a 
top-down strategy that uses horizontal fragmentation [32| . 
The authors propose an algorithm for deriving horizontal 
fragments from the fact table based on queries that are de- 
fined on all dimension tables. Finally, Wehrle et al. pro- 
pose to distribute and query a warehouse on a computing 
grid [3^ . They use derived horizontal fragmentation to split 
the data warehouse and build a so-called block of chunks, a 
data set defining a fragment. 

In summary, these proposals generally exploit static de- 
rived horizontal fragmentation to reduce irrelevant data ac- 
cess rate and efficiently process join operations across mul- 



tiple relations [5] 1321 I36j . In the literature, the prevalent 
methods used for derived horizontal fragmentation are the 
following [23j . 

• Predicate construction. This method fragments a 
relation by using a complete and minimal set of pred- 
icates [32]. Completeness means that two relation in- 
stances belonging to the same fragment have the same 
probability of being accessed by any query. Minimality 
garantees that there is no redundancy in predicates. 

• Affinity-based fragmentation. This method is an 
adaptation of vertical fragmentation methods to hori- 
zontal fragmentation [31] . It is based on the predicate 
affinity concept [30], where affinity defines query fre- 
quency. Specific matrices (predicate usage and affinity 
matrices) are exploited to cluster selection predicates. 
A cluster is defined as a selection predicate cycle and 
forms a dimension graph fragment. 

2.3 XML Database Fragmentation 

Recently, several fragmentation techniques for XML data 
have been proposed. They split an XML document into a 
new set of XML documents. Their main objective is either to 
improve XML query performance [7] 1181 [25] or to distribute 
or exchange XML data over a network [51 llOj. 

To fragment XML documents. Ma et al. define a new 
fragmentation type: split [241 125] . which is inspired from 
the oriented-object domain. This fragmentation splits XML 
document elements and assigns a reference to each sub- 
element. The references are then added to the Document 
Type Definition (DTD) defining the XML document. The 
authors extend the DTD and XML-QL languages. Boni- 
fati et al. also propose a fragmentation strategy for XML 
documents that is driven by structural constraints [7] [8]. 
This strategy uses both heuristics and statistics. Andrade 
et al. propose to apply fragmentation to an homogeneous 
XML collection [51. They adapt traditional fragmentation 
techniques to an XML document collection and base their 
proposal on the Tree Logical Class (TLC) algebra 33,. The 
authors also evaluate these techniques and show that hori- 
zontal fragmentation provides the best performance. 

Gertz and Bremer introduce a distribution approach for 
an XML repository [18]. They propose a fragmentation 
method and outline an allocation model for distributed XML 
fragments in a centralized archirecture. Gertz and Bremer 
also define horizontal and vertical fragmentation for an XML 
document. A fragment is defined with a path expression lan- 
guage, called XF, which is derived from XPath. This frag- 
ment is obtained by applying an XF expression on a graph 
KG representing XML data. Moreover, the authors define 
exclusion expressions that ensure fragment coherence and 
disjunction. 

Bose and Fegaras use XML fragments for data exchange 
in a peer-to-peer network (P2P), called XP2P [TD]. XML 
fragments are interrelated and each is uniquely identified by 
an ID. The authors propose a fragmentation schema, called 
Tag Structure, to define the structure of data and fragmenta- 
tion information. Bonifati et al. also define XML fragments 
for a P2P framework [S]. An XML fragment is obtained and 
identified by a single path expression, a root-to-node path 
expression XP, and managed on a specific peer. In addition, 
the authors associate to each fragment two path expressions: 



super fragment and child fragment. These paths ensure the 
identification of fragments and relationships. 

In summary, these proposals adapt classical static frag- 
mentation methods to split XML data. An XML fragment 
is defined and identified by a path expression 9 or an XML 
algebra operator [2]. Fragmentation is performed on a single 
XML document |24l 125) or on an homogeneous XML collec- 
tion [2]. Note that XML data warehouse fragmentation has 
not been addressed yet, to the best of our knowledge. 

2.4 Data Mining-based Fragmentation 

Although data mining has proved useful for selecting phys- 
ical data structures that enhance performance, such as in- 
dexes or materialized views [3l |4], few approaches exploit 
data mining for fragmentation. 

Gorla and Betty exploit association rules (by adapting the 
Apriori algorithm 1 ) for vertical fragmentation approach in 
relational databases 20^. 

Darabant and Campan propose the horizontal fragmenta- 
tion method for object-oriented distributed databases based 
on k- means clustering we inspire from [14]. This method 
clusters object instances into fragments by taking all com- 
plex relationships between classes into account (aggregation, 
associations and links induced by complex methods). 

Finally, Fiolet and Toursel propose a parallel, progressive 
clustering algorithm to fragment a database and distribute 
it over a grid |17) . It is inspired by the CLIQUE sequen- 
tial clustering algorithm that consists in clustering data by 
projection. 

Though in limited number, these studies clearly demon- 
strate how data mining can be used for vertical and hori- 
zontal fragmentation, through association rule mining and 
clustering, respectively. They are also static, though. 

3. K-MEANS-BASED FRAGMENTATION 

Although XML data warehouse architectures from the lit- 
erature share a lot of concepts (mostly originating from clas- 
sical data warehousing), they are nonetheless all different. 
Hence, we proposed a unified, reference XML data ware- 
house model that synthesizes and enhances existing mod- 
els [29], and on which we base our fragmentation work. We 
first recall this model before detailing our fragmentation ap- 
proach. 

3.1 XML Warehouse Reference Model 

XML warehousing approaches assume that the warehouse 
is composed of XML documents that represent both facts 
and dimensions. All these studies mostly differ in the way 
dimensions are handled and the number of XML documents 
that are used to store facts and dimensions. A performance 
evaluation study of these different representations showed 
that representing facts in one single XML document and 
each dimension in one XML document allowed the best per- 
formance [12]. Moreover, this representation also allows to 
model constellation schemas without duplicating dimension 
information. Several fact documents can indeed share the 
same dimensions. Hence, we adopt this architecture model. 
More precisely, our reference data warehouse is composed of 
the following XML documents: 

• dw-model.xml that represents warehouse metadata; 

• a set of factsf.xml documents that each store infor- 
mation related to set of facts /; 



• a set of dimensioud-xml documents that each store a 
given dimension d's member values. 

The dw-model.xml document (Figure [T]) defines the mul- 
tidimensional structure of the warehouse. Its root node, 
DW-model, is composed of two types of nodes: dimension 
and FactDoc. A dimension node defines one dimension, its 
possible hierarchical levels {Level elements) and attributes 
(including their types), as well as the path to the correspond- 
ing dimension d-xml document. A FactDoc element defines 
a fact, i.e., its measures, references to the corresponding di- 
mensions, and the path to the corresponding factsf.xml 
document. 




Figure 1: dw-model.xml graph structure 

A factsf.xml document stores facts (Figure [5{ a)). The 
document root node, FactDoc, is composed of fact subele- 
ments that each instantiate a fact, i.e., measure values and 
dimension references. These identifier-based references sup- 
port the fact-to-dimension relationship. 

Finally, a dimensiond-xml document helps instantiate one 
dimension, including any hierarchical level (Figure[2jb)). Its 
root node, dimension, is composed of Level nodes. Each one 
defines a hierarchy level composed of instance nodes that 
each define the level's member attribute values. In addi- 
tion, an instance element contains Roll-up and Drill-Down 
attributes that define the hierarchical relationship within di- 
mension d. 
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Figure 2: factsf.xml (a) and dimensioud.xml (b) 
graph structures 



3.2 Fragmentation Approach 

3.2.1 Principle 

Since the aim of fragmentation is to optimize query re- 
sponse time, tfie prevalent fragmentation strategies are work- 
load driven [3 [TH El] |32] . More precisely, they exploit 
selection predicates found in workload queries to derive suit- 
able fragments. Our approach also belongs to this family. 
Its general principle is summarized in Figure [3] It out- 
puts both a fragmentation schema (metadata) and the frag- 
mented XML warehouse. It is subdivided into three steps 
that are detailed in the following sections: 

1. selection predicate extraction from the query work- 
load; 

2. predicate clustering with the k-means method; 

3. fragment construction with respect to predicate clus- 
ters. 

XQuery workload 

K_ 

1 . Selection predicate 
extraction 



91 



2. Predicate clustering 
(k-means) 



3. Fragment 
construction 



Source XML 
data warehouse 
(schema + data) 



Fragmentation XML warehouse 
schema fragments 



Figure 3: K-means-based fragmentation principle 

3.2.2 Selection Predicate Extraction 

Selection predicate set P is simply parsed from workload 
W . For example, let Ws be the sample XQuery workload 
provided in Figure|4]and Ps the corresponding predicate set. 
Ps = {pi,P2,P3,P4, ...}, where: 

pi = %y / attribute[@id = " c_nation_key"]/@value>" 15" , 
P2 = $y / attribute[@id = " C-nation_key"]/@value = "13", 
P3 = $y/attribute[@id = "p_type"]/@value = "PBC" and 
Pi ~ %y / attribute[@id = " d_date-.name"]/ ©value = " Sat." . 
For example, p2 and p3 are selection predicates obtained 
from query g2 € Ws- 

Parsed predicates are then coded in a query-predicate ma- 
trix QP whose general term QPij equals to 1 if predicate 
Pj £ P appears in query qi £ W, and to otherwise. For 
example, the QPs matrix corresponding to Ws and Ps is 
featured in Table [T] 

3.2.3 Predicate Clustering 

Our objective is to derive fragments that optimize data 
access for a given set of queries. Since horizontal fragments 
are built from predicates, clustering predicates with respect 
to queries achieves our goal. Intuitively, we ideally seek to 
build rectangles (clusters) of Is in matrix QP. We chose the 
widely-used k-means algorithm f^F' for clustering. This al- 
gorithm inputs vectors of object attributes (columns of QP 
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for $x in //FactDoc/Fact, 

$y in //dimension[@dini-id— "Customcr'']/Lcvel/instance 
where $y/attribute[@id— "c_nation_key"]/@ value > "15" 
and $x/dimension[@dim-id— "Customer"] / @ value-id— $y / @id 
return $x 

for $x in //FactDoc/Fact, 

$y in //dimcnsion[@dim-id— "Customer"]/Levcl/instance, 
$z in //dimcnsion[@dim-id— "Part"]/LeveI/instance 
where $y/attribute[@id—"c_nation_lcey"]/@ value— "13" 
and $y/attributo[@id="p_type"]/@value="PBC" 
and $x/dimension[@dim-id— "Customer"] / @ value-id— $y / @id 
and $x/dimension[@dim-id="Part"]/@value-id=$z/iaid 
return $x 

for $x in //FactDoc/Fact, 

$y in //dimension[@dim-id— "Customer"]/Level/instance, 
$z in //dimcnsion[@dim-id— "Date"]/Level/instance 
where $y / attribute [@id— "c_nation_key"] / @ value— "13" 
and $y/attribute[@id—"d_date_name"]/@ value— "Sat." 
and $x/dimension[@dim-id— "Customer"] / @value-id— $y /@id 
and $x/dimension[@dim-id="Part"]/@valuo-id=$z/@id 
return $x 



Figure 4: XQuery workload snapshot Ws 
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Table 1: Sample query-predicate matrix QPs 



in our case). It attempts to find the centers of natural clus- 
ters in source data by minimizing total intra-cluster variance 
5Zi=i X^a: 6C (^J ~ mO^i where d, i = 1, k are the k out- 
put clusters and is the centroid (mean point) of points 
Xj G Ci. Let C be the set of all clusters d. 

Usually, having k as an input parameter is viewed as a 
drawback in clustering. In our case, this turns out to be 
an advantage, since we want to limit the number of clus- 
ters/fragments, typically to the number of nodes the XML 
data warehouse will be distributed on. 

In practice, we used the Weka |22j SimpleKMeans imple- 
mentation of k-means. SimpleKMeans uses the Euclidean 
distance to compute distances between points and clusters. 
It directly inputs matrix QP (acually, the pj vectors) and 
k, and outputs set of predicate clusters C. For example, on 
matrix QPs with k — 2, SimpleKMeans outputs: 

Cs = {{Pl},{P2,P3,P4}}- 

3.2.4 Fragment Construction 

The fragmentation construction step is itself subdivided 
into two substeps (Figure [5]). First, predicate cluster set 
C is joined to warehouse schema (document dw-model.xml) 
to produce an XML document named f rag-schema, xml that 
represents the fragmentation schema (Figure O . Its root 
node. Schema, is composed of fragment elements. Each 
fragment is identified by an @id attribute and contains di- 
mension elements. A dimension element is identified by a 
<Sname attribute and contains predicate elements that store 
the predicates used for fragmentation. For example, the 
fragmentation schema frag -schemas .xml corresponding to 
cluster set Cs is provided in Figure [T] 

In this process, we also output a set of XQueries (the 
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<Schema> 

<fragmcnt id— "fl"> 

<dimcnsion name— "Customcr"> 

<predicatc name— "pi" /> 
< /dimension > 
< /fragment > 
<fragment id— "f2"> 

<dimension name— "Customer"> 

<predicate name— "p2" /> 
</dimension> 
<dimension name— "Part"> 

<predicate name— "p3" /> 

< / dimension> 
<dimension name— "Date"> 

<predicate name— "p4" /> 

< / dimension> 
< /fragment > 

< /Schema> 

Figure 7: Sample frag-schemas .xml document 



Figure 5: Fragment construction substeps 
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Figure 6: frag-schema.xml grapli structure 

fragments.xq script) that, applied to the XML data ware- 
house (i.e., the whole set of f acts f .xml and dimension d. xml 
documents), produces the actual fragments, which we store 
in a set of facts /. .xml and dimensiona- .xml documents, 
i — + 1. As fragments, these documents indeed 

bear the same schema than the original warehouse. The 
{k + fragment is based on an additional predicate, de- 
noted ELSE, which is the negation of the conjunction of 
all predicates in P and is necessary to ensure fragmenta- 
tion completeness (Section I2.2|l . In our running example, 
ELSE = ^(pi Ap2 Ap3 A Pi). 

Figure[8]provides an excerpt from the fragments s .xq script 
that helps build fragment /2 from Figure[7] Dimension frag- 
ments are generated first, one by one, through selections ex- 
ploiting the predicate(s) associated to the current dimension 
(three first queries from Figure |8]). Then, fragmentation is 
derived on facts by joining the original fact document to the 
newly-created dimension fragments (last query). 

4. EXPERIMENTS 

Since derived horizontal fragmentation is a NP-hard prob- 
lem [TT] solved by heuristics, we choose to validate our pro- 
posal experimentally. 

4.1 Experimental conditions 

We use XWeB (XML Data Warehouse Benchmark) [57] 
as a test platform. XWeB is based on the reference model 
defined in Section 13.11 and proposes a test XML data ware- 
house and its associated XQuery decision-support workload. 



element dimension! attribute dim-id{Customer}, element Level{ 
attribute id {Customers}, 

for $x in document("rfzmensioncustoTTier-^*T^^")//Level 
where $x/ /attribute[@id— "e_nation_key"]/@vaIue— "13"] 
return $x } 
} 

element dimension! attribute dim-id{Part}, clement Lcvelj 
attribute id {Part}, 

for $x in document("(izme7isionpaT-t-xmr')//Level 
where $x//attribute[@id^"p_type"]/@value^"PBC"] 
return $x } 

} 

clement dimension! attribute dim-id{Datc}, element Level{ 
attribute id {Date}, 

for $x in documcn.t{''''di'rnensionj:)ate -xnil") / /hevel 
where $x/ /attribute[@id— "d_date_name"]/@value— "Sat."] 
return $x } 

} 

element FactDoc { 

for $x in //FactDoc/Fact, 

$y in document("(izmenszoncustoTnerj:2 -^"^^'OZ/i^^tance, 
$z in document("d'ime'n,s2oni3^^^^2 .xm/")//instance, 
$t in document("(iimenszon£ja,tey2 

where $x / dimension [@dim-id— "Customer"] / @value-id— $y /@id 
and $x/dimcnsion[@dim-id— "Part"]/@value-id— $z/@id 
and $x/dimcnsion[@dim-id— "Date"]/@valuc-id— $t/@id 
return $x 

} 

Figure 8: Excerpt from sample fragmentss -xq script 

XWeB's warehouse consists of sale facts characterized by 
the amount (of purchased products) and quantity (of pur- 
chased products) measures. These facts are stored in the 
facts sales -xml document and are described by four dimen- 
sions: Customer, Supplier, Date and Part stored in the 
dimensioncustomer-xml, dimensionsuppiier -xml , 
dimension Date .xml and dimension Part. xml documents, re- 
spectively. XWeB's warehouse characteristics are displayed 
in Table [1 

XWeB's workload is composed of queries that exploit the 
warehouse through join and selection operations. We extend 
this workload by adding queries and selection predicates in 
order to obtain a significant fragmentation. Due to space 
constraints, our workload is only available on-linsQ- 

We ran our tests on a Pentium 2 GHz PC with 1 GB of 
main memory and an IDE hard drive under Windows XP. 
We use the X-Hive XML native DBM^ to store and query 
the warehouse. Our code is written in Java and connects 



htt p : //eric . uni v-Iyon2 . fr / ~hmahboubi /Workload / workload .xq 
http:/ /www. x-hive.com/products/db/ 



Facts 


Maximum number of cells 


Sale facts 


7000 


Dimensions 


Number of instances 


Customer 


1000 


Supplier 


1000 


Date 


500 


Part 


1000 


Documents 


Size (MB) 


facts sales -xml 


2.14 


dimensioncustomer -xml 


0.431 


dimension Supplier .xml 


0.485 


dimensionoate-xml 


0.104 


dimensionpart-xml 


0.388 



Table 2: XWeB warehouse characteristics 



to X-Hive and Weka through their respective Application 
Programming Interfaces (APIs). It is available on demand. 

4.2 Fragmentation Strategy Comparison 

In this first series of experiments, we aim at comparing our 
k-means-based horizontal fragmentation approach (denoted 
KM) to the classical derived horizontal fragmentation tech- 
niques, namely by predicate construction (PC) and affinity- 
based (AB) primary fragmentation (Section 12. 2p . which we 
adapted to XML data warehouses [28]. We also record per- 
formance when no fragmentation is applied (NF), for refer- 
ence. 

4.2.1 Query Response Time 

This experiment measures workload execution time with 
the three fragmentation strategies we adopted. For KM, we 
arbitrarily fixed A: = 8, which could correspond to a com- 
puter cluster's size. The fragments we achieve are stored in 
distinct collections to simulate data distribution. Each col- 
lection can indeed be considered to be stored on a distinct 
node and can be identified, targeted and queried separately. 
To measure query execution time over a fragmented ware- 
house, we first identify the required fragments with the frag- 
schema.xml document. Then, we execute the query over 
each fragment and save execution time. To simulate parallel 
execution, we only consider maximum execution time. 

Figure[2]plots workload response time with respect to data 
warehouse size (expressed in number of facts). It clearly 
shows that fragmentation significantly improves response 
time, and that KM fragmentation performs better than PC 
and AB fragmentation when the warehouse scales up. Work- 
load execution time is indeed, on an average, 86.5% faster 
with KM fragmentation than with NF, and 36.7% faster with 
KM than with than AB. We believe our approach performs 
better than classical derived horizontal fragmentation tech- 
niques because these latter produce many more fragments 
(159 with PC and 119 with AB vs. 9 with KM). Hence, 
at workload execution time, queries must access many frag- 
ments (up to 50 from our observations), which multiplies 
query distribution and result reconstruction costs. The num- 
ber of accessed fragments is much lower with KM (typically 
2 fragments in our experiments). 



—♦—KM -■— NF -^PC -X-AB 

180,00 T 




7000 



Number of facts 

Figure 9: Fragmentation efficiency comparison 

4.2.2 Fragmentation Overhead 

We also compare the PC, AB and KM (fe — 8) fragmen- 
tation strategies in terms of overhead (i.e., fragmentation 
algorithm execution time). When assessing performance, it 
is indeed necessary to find a fair trade-off between gain and 
overhead. Table |3] summarizes the results we obtain for an 
arbitrarily fixed data warehouse size of 3,000 facts. It shows 
that KM clearly outperforms AB and PC, which is in line 
with these algorithms' complexities : 0(|P|), OdPj'^) and 
0(2'^'), respectively. While AB and PC would have to run 
off-line, KM could on the other hand be envisaged to run 
on-line. 





PC 


AB 


KM 


Execution time (h) 


16.8 


11.9 


0.25 



Table 3: Fragmentation overhead comparison 



4.3 Influence of Number of Clusters 

In this experiment, we fixed data warehouse size (to 4,000 
and 5,000 facts, respectively) and varied KM parameter k to 
observe its influence on workload response time. Figure [TOI 
conflrms that performance improves quickly when fragmen- 
tation is applied, but tends to degrade when the number of 
fragments increases, as we explained in Section 14.2.11 Fur- 
thermore, it hints that an optimal number of clusters for 
our test data warehouse and workload lies between 4 and 
6, making us conclude that over-fragmentation must be de- 
tected and avoided. Note that, on Figure 1101 k = 1 cor- 
responds to the NF experiment (this one fragment is the 
original warehouse). 

5. CONCLUSION 

In this paper, we have introduced an approach for frag- 
menting XML data warehouses that is based on data min- 
ing, and more precisely clustering and the k-means algo- 
rithm. Classical derived horizontal fragmentation strategies 
run automatically and output an unpredictable number of 
fragments, which is nonetheless crucial to keep under con- 
trol. By contrast, our approach allows to fully master the 
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Figure 10: Influence of number of clusters 

number of fragments through the k-means k parameter. 

To validate our proposal, we have compared our fragmen- 
tation strategy to XML adaptations of the two prevalent 
fragmentation methods for relational data warehouses. Our 
experimental results show that our approach, by producing 
a lower number of fragments, outperforms both the others 
in terms of performance gain and overhead. 

Now that we have efficiently fragmented an XML data 
warehouse, our more direct perspective is to distribute it on 
a data grid. This raises several issues that include processing 
a global query into subqueries to be sent to the right nodes 
in the grid, and reconstructing a global result from sub- 
query results. Properly indexing the distributed warehouse 
to guarantee good performance shall also be very important. 

Finally, in a continuous effort to minimize the data ware- 
house administration function and aim at autoadministra- 
tive systems [3l |4] , we plan to make our data mining based- 
fragmentation strategy dynamic. The idea is to perform 
incremental fragmentation when the warehouse is refreshed. 
This could be achieved with the help of an incremental vari- 
ant of the k-means algorithm 34;. 
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